This analysis examines simulated income data for Hungary, focusing on the relationships between income and various demographic factors including age, location, occupation, and gender. The dataset is simulated to reflect realistic patterns while maintaining a manageable size for analysis.
The data was simulated by the data_simulation.R script.
The data is available in the hungarian_income_data.csv
file.
Important things to note about the generated data:
Only the 8 most populated cities of Hungary are taken into count weighted by their population. List of cities: Budapest, Debrecen, Szeged, Miskolc, Pécs, Győr, Szombathely, Eger.
Only the 10 most common occupations are taken into count weighted by their frequency in the workforce. List of occupations: Software Developer, Teacher, Doctor, Sales Representative, Engineer, Accountant, Nurse, Manager, Chef, Driver.
The age distribution is generated by a beta distribution with parameters \(\alpha = 2\) and \(\beta = 3\) and multiplied by 95 to put the end result in the desired range. The beta distribution with the aforementioned parameters skews the age distribution towards younger ages, which is more realistic.
There are three groups of people categorized by their age:
Underage: each person has a random age at which they start working between 14 and 24.
Working age: 19-67
Pension age: each person has a random retirement age between 60 and 75.
Under 18 people have no income.
Working age people have a regular income based on their age, occupation, city, and gender.
Pension age people have a pension based on their occupation and city.
All working age people are considered to be employed.
The income of a working age man is 20.000 HUF higher than the income of a working age woman in the same occupation, city, and age group.
library(forcats)
data <- data %>%
mutate(age_group = cut(age,
breaks = seq(0, 100, by = 2),
right = FALSE,
include.lowest = TRUE,
labels = seq(0, 98, by = 2)))
dem_pyramid <- data %>%
group_by(age_group, gender) %>%
summarise(count = n(), .groups = 'drop') %>%
mutate(count = ifelse(gender == "Male", -count, count))
ggplot(dem_pyramid, aes(x = age_group, y = count, fill = gender)) +
geom_bar(stat = "identity", width = 0.8, color = "black") +
scale_y_continuous(labels = abs, expand = expansion(mult = c(0.05, 0.05))) +
scale_fill_manual(values = c("Male" = "#00BFFF", "Female" = "#FF3B3B")) +
coord_flip() +
labs(title = "Population Pyramid of Simulated Hungarian Data",
x = "Age Group",
y = "Count",
fill = "Gender") +
custom_theme +
theme(legend.position = "top",
axis.text.y = element_text(size = 10, face = "bold"),
plot.margin = margin(t = 20, r = 20, b = 20, l = 20))We employ two steps to clean the data: inter quartile range (IQR) outlier detection, and clustering.
We use IQR outlier detection to reduce the noise of the data.
We employ K-Means clustering to find the three major demographics groups: unemployed/young, working age, retired.
After cleaning the data and finding the three demographic groups, we solely focus on the middle cluster, the working age people, as this study aims to analyze the income of the Hungarian population and it would be nonsensical to analyze unemployed people or retired people as if their pension was a salary.
detect_outliers <- function(x) {
q1 <- quantile(x, 0.25)
q3 <- quantile(x, 0.75)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
return(x < lower_bound | x > upper_bound)
}
outliers <- detect_outliers(data$income)
data_clean <- data[!outliers, ]
set.seed(42)
income_age_matrix <- data_clean %>%
select(income, age) %>%
scale()
kmeans_result <- kmeans(income_age_matrix, centers = 3, nstart = 25)
data_clean$cluster <- kmeans_result$cluster
cluster_summary <- data_clean %>%
group_by(cluster) %>%
summarise(
mean_income = mean(income),
.groups = 'drop'
) %>%
arrange(mean_income)
cluster_labels <- c("Working Age", "Pension Age", "Unemployed/Young")
data_clean$income_group <- factor(data_clean$cluster,
labels = cluster_labels[order(cluster_summary$mean_income)])
ggplot(data_clean, aes(x = age, y = income, color = income_group)) +
geom_point(alpha = 0.5) +
scale_color_viridis_d() +
labs(title = "Age vs Income by Cluster",
x = "Age",
y = "Income (HUF)",
color = "Income Group") +
custom_themesummary_stats <- summary(data)
kable(summary_stats, caption = "Summary Statistics of the Dataset (Outliers Removed)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| age | city | occupation | gender | income | starting_age | retirement_age | age_group | cluster | income_group | |
|---|---|---|---|---|---|---|---|---|---|---|
| Min. :14.00 | Length:7375 | Length:7375 | Length:7375 | Min. :379808 | Min. :14.00 | Min. :60.00 | 30 : 397 | Min. :1 | Working Age :7375 | |
| 1st Qu.:29.00 | Class :character | Class :character | Class :character | 1st Qu.:529234 | 1st Qu.:18.00 | 1st Qu.:65.00 | 32 : 377 | 1st Qu.:1 | Pension Age : 0 | |
| Median :39.00 | Mode :character | Mode :character | Mode :character | Median :579982 | Median :19.00 | Median :67.00 | 34 : 377 | Median :1 | Unemployed/Young: 0 | |
| Mean :40.14 | NA | NA | NA | Mean :582750 | Mean :18.89 | Mean :67.07 | 28 : 369 | Mean :1 | NA | |
| 3rd Qu.:50.00 | NA | NA | NA | 3rd Qu.:632598 | 3rd Qu.:20.00 | 3rd Qu.:69.00 | 42 : 365 | 3rd Qu.:1 | NA | |
| Max. :73.00 | NA | NA | NA | Max. :864318 | Max. :24.00 | Max. :75.00 | 40 : 360 | Max. :1 | NA | |
| NA | NA | NA | NA | NA | NA | NA | (Other):5130 | NA | NA |
By the following income distribution plot, we can clearly see that on average a man has a higher income than a woman. This does not yet mean that given equal positions a man earns more money. However, it is indicative that we should further analyze this aspect of the data.
ggplot(data %>% filter(age >= 18), aes(x = income, fill = gender)) +
geom_density(alpha = 0.6) +
scale_fill_viridis_d() +
labs(title = "Income Distribution by Gender",
subtitle = "(working age only)",
x = "Income (HUF)",
y = "Density") +
custom_themeThe following plot shows how the income is distributed against the age. An important thing to note is that as a person ages, their income increase. However it plateaus after a point, moreover, it even decreases in certain cases.
ggplot(data, aes(x = age, y = income, color = gender)) +
geom_point(alpha = 0.1, width = 0.2) +
scale_color_viridis_d() +
labs(title = "Income Distribution by Age",
subtitle = "(working age only)",
x = "Age",
y = "Income (HUF)") +
custom_themeTo get a better grasp of how the income distribution is made up, we can split the data by occupation, giving us a new perspective into how certain occupation are more handsomely rewarded. We can see that Software Developers and Doctors have the highest income, compared to Sales Representatives who earn a lower income.
ggplot(data %>% filter(age >= 18), aes(x = reorder(occupation, income, FUN = median), y = income, color = occupation)) +
geom_boxplot(alpha = 0.7) +
geom_jitter(alpha = 0.1, width = 0.2) +
scale_color_viridis_d() +
coord_flip() +
labs(title = "Income Distribution by Occupation",
subtitle = "(working age only)",
x = "Occupation",
y = "Income (HUF)") +
custom_themeThe following plot, like the one before, split the data. However, now we are analyzing how the city in which the person works at contributes to their salary. It is hard not to notice that the average person working in the capital, Budapest, enjoys a higher income compared to other cities.
ggplot(data %>% filter(age >= 18), aes(x = reorder(city, income, FUN = median), y = income, fill = city)) +
geom_violin(alpha = 0.7) +
geom_boxplot(width = 0.2, alpha = 0.5) +
scale_fill_viridis_d() +
coord_flip() +
labs(title = "Income Distribution by City",
subtitle = "(working age only)",
x = "City",
y = "Income (HUF)") +
custom_themeincome_by_category <- data %>%
filter(age >= 18) %>%
group_by(occupation, city, gender) %>%
summarise(
mean_income = mean(income),
count = n(),
.groups = 'drop'
) %>%
arrange(desc(mean_income))
# heatmap
ggplot(income_by_category, aes(x = city, y = occupation, fill = mean_income)) +
geom_tile() +
scale_fill_viridis(name = "Mean Income (HUF)") +
facet_wrap(~gender) +
labs(title = "Mean Income by Occupation, City, and Gender",
subtitle = "(working age only)",
x = "City",
y = "Occupation") +
custom_theme +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
strip.text = element_text(face = "bold"))top_earners <- income_by_category %>%
arrange(desc(mean_income)) %>%
head(10)
kable(top_earners,
caption = "Top 10 Highest Earning Combinations",
digits = 0) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| occupation | city | gender | mean_income | count |
|---|---|---|---|---|
| Software Developer | Budapest | Male | 745832 | 99 |
| Software Developer | Budapest | Female | 722333 | 100 |
| Doctor | Budapest | Male | 720956 | 61 |
| Doctor | Budapest | Female | 696091 | 77 |
| Software Developer | Debrecen | Male | 695662 | 43 |
| Software Developer | Szeged | Male | 694999 | 30 |
| Manager | Budapest | Male | 692519 | 107 |
| Software Developer | Szeged | Female | 684845 | 43 |
| Engineer | Budapest | Male | 675140 | 133 |
| Software Developer | Debrecen | Female | 674727 | 40 |
Previously we have seen from a plot that a man on averages is payed more than a woman. To test this result we propose a two sample t-test (assuming that both samples come from a normal distribution), where our nulhypothesisis that the mean male income is equal to the mean female income. Before employing a two sample t-test we must first investigate if the variance of the two samples differ significantly, for such we use an F-test.
##
## F test to compare two variances
##
## data: income by gender
## F = 0.99942, num df = 3636, denom df = 3737, p-value = 0.986
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.9369317 1.0661025
## sample estimates:
## ratio of variances
## 0.999418
Given the above results we accept that the variance of the two samples are equal, because we did not find significant proof to state that the variances differ. Thus, we proceed our investigation with the assumption that the variance of the distributions from which the two samples are drown from are equal.
# Test if there's a significant difference in income between genders
t_test_result <- t.test(income ~ gender, data = data, alternative="less", var.equal=TRUE)
print(t_test_result)##
## Two Sample t-test
##
## data: income by gender
## t = -10.655, df = 7373, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Female and group Male is less than 0
## 95 percent confidence interval:
## -Inf -15335.5
## sample estimates:
## mean in group Female mean in group Male
## 573558.0 591693.7
The above results show that the p-value is way below our target of \(0.05\), which means we can state with outmost certainty that we found significant proof that the mean income in the female group is less than the mean income in the male group.
# Test if there are significant differences in income between cities
city_anova <- aov(income ~ city, data = data)
print(summary(city_anova))## Df Sum Sq Mean Sq F value Pr(>F)
## city 7 1.356e+13 1.937e+12 539.9 <2e-16 ***
## Residuals 7367 2.643e+13 3.587e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Test if there are significant differences in income between occupations
occupation_anova <- aov(income ~ occupation, data = data)
print(summary(occupation_anova))## Df Sum Sq Mean Sq F value Pr(>F)
## occupation 9 1.545e+13 1.717e+12 515.5 <2e-16 ***
## Residuals 7365 2.453e+13 3.331e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data <- data %>%
mutate(income_category = cut(income,
breaks = quantile(income, probs = seq(0, 1, 0.25)),
labels = c("Low", "Medium-Low", "Medium-High", "High"),
include.lowest = TRUE))
# Chi-squared test for cities
city_chi <- chisq.test(table(data$city, data$income_category))
print(city_chi)##
## Pearson's Chi-squared test
##
## data: table(data$city, data$income_category)
## X-squared = 2624.1, df = 21, p-value < 2.2e-16
##
## Pearson's Chi-squared test
##
## data: table(data$occupation, data$income_category)
## X-squared = 2736.5, df = 27, p-value < 2.2e-16
ggplot(data, aes(x = city, fill = income_category)) +
geom_bar(position = "fill") +
scale_fill_viridis_d() +
labs(title = "Income Distribution by City",
x = "City",
y = "Proportion",
fill = "Income Category") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
custom_themeggplot(data, aes(x = occupation, fill = income_category)) +
geom_bar(position = "fill") +
scale_fill_viridis_d() +
labs(title = "Income Distribution by Occupation",
x = "Occupation",
y = "Proportion",
fill = "Income Category") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
custom_themeThe chi-squared tests help us determine if there’s a significant relationship between: 1. Cities and income categories 2. Occupations and income categories
The tests will show if the distribution of income categories is independent of city/occupation or if there’s a significant association. The visualizations help us understand the nature of these relationships by showing the proportion of each income category within each city and occupation.
# Test if income distributions differ between genders
male_income <- data$income[data$gender == "Male"]
female_income <- data$income[data$gender == "Female"]
ks_test <- ks.test(male_income, female_income)
print(ks_test)##
## Asymptotic two-sample Kolmogorov-Smirnov test
##
## data: male_income and female_income
## D = 0.10969, p-value < 2.2e-16
## alternative hypothesis: two-sided
# Test if income distributions differ between cities
kruskal_test <- kruskal.test(income ~ city, data = data)
print(kruskal_test)##
## Kruskal-Wallis rank sum test
##
## data: income by city
## Kruskal-Wallis chi-squared = 2574.8, df = 7, p-value < 2.2e-16
ggplot(data %>% filter(age >= 18), aes(x = age, y = income, color = gender)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", se = TRUE) +
scale_color_viridis_d() +
labs(title = "Relationship between Age and Income",
subtitle = "(working age only)",
x = "Age",
y = "Income (HUF)") +
custom_theme# Convert categorical variables to factors
data_working_age <- data %>% filter(age >= 18)
data_working_age$city <- as.factor(data_working_age$city)
data_working_age$occupation <- as.factor(data_working_age$occupation)
data_working_age$gender <- as.factor(data_working_age$gender)
# Fit the model
model1 <- lm(income ~ age + I(age^2) + city + occupation + gender, data = data_working_age)
# Create a beautiful model summary table
model_summary <- summary(model1)
kable(tidy(model_summary), caption = "Multiple Linear Regression Results (Working Age Only)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 434779.5680 | 3042.221285 | 142.91517 | 0 |
| age | 8492.2241 | 146.411115 | 58.00259 | 0 |
| I(age^2) | -79.2278 | 1.733131 | -45.71369 | 0 |
| cityDebrecen | -50118.9464 | 885.873428 | -56.57574 | 0 |
| cityEger | -100673.3825 | 1223.767096 | -82.26515 | 0 |
| cityGyőr | -98817.6008 | 1136.347121 | -86.96075 | 0 |
| cityMiskolc | -99847.5434 | 1064.213618 | -93.82284 | 0 |
| cityPécs | -98754.3471 | 1121.335644 | -88.06850 | 0 |
| citySzeged | -49284.8644 | 997.322532 | -49.41718 | 0 |
| citySzombathely | -99991.5016 | 1348.646293 | -74.14212 | 0 |
| occupationChef | -42406.4096 | 1535.934166 | -27.60952 | 0 |
| occupationDoctor | 59742.5726 | 1554.207886 | 38.43924 | 0 |
| occupationDriver | -50964.5811 | 1535.962289 | -33.18088 | 0 |
| occupationEngineer | 18051.4651 | 1251.405236 | 14.42496 | 0 |
| occupationManager | 37826.6478 | 1351.616466 | 27.98623 | 0 |
| occupationNurse | -32257.7329 | 1187.823213 | -27.15702 | 0 |
| occupationSales Representative | -60650.2152 | 1057.574345 | -57.34842 | 0 |
| occupationSoftware Developer | 88380.8010 | 1335.055549 | 66.20009 | 0 |
| occupationTeacher | -22511.0613 | 1128.935083 | -19.94008 | 0 |
| genderMale | 18870.6304 | 582.757427 | 32.38162 | 0 |
# Enhanced model diagnostics
par(mfrow = c(2,2))
plot(model1, col = income_palette[1], pch = 19, cex = 0.7)# Fit polynomial regression
model2 <- lm(income ~ poly(age, 3), data = data_working_age)
# Create a static plot
ggplot(data_working_age, aes(x = age, y = income)) +
geom_point(alpha = 0.1, color = income_palette[1]) +
geom_smooth(method = "lm", formula = y ~ poly(x, 3),
color = income_palette[5], fill = income_palette[5], alpha = 0.2) +
labs(title = "Polynomial Regression: Age vs Income",
subtitle = "Cubic polynomial fit with confidence interval (Working Age Only)",
x = "Age",
y = "Income (HUF)") +
custom_theme# Model comparison
model_comparison <- data.frame(
Model = c("Multiple Linear", "Polynomial"),
R_squared = c(summary(model1)$r.squared, summary(model2)$r.squared),
Adj_R_squared = c(summary(model1)$adj.r.squared, summary(model2)$adj.r.squared)
)
kable(model_comparison, caption = "Model Comparison (Working Age Only)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)| Model | R_squared | Adj_R_squared |
|---|---|---|
| Multiple Linear | 0.8851681 | 0.8848691 |
| Polynomial | 0.1541084 | 0.1537614 |
What is the predicted income of a 35 years old male software developer working in Budapest?
# Create a sample prediction
new_data <- data.frame(
age = 35,
city = "Budapest",
occupation = "Software Developer",
gender = "Male"
)
# Predict income
prediction <- predict(model1, newdata = new_data, interval = "prediction")
print(prediction[1])## [1] 742204.8
The analysis of the simulated Hungarian income data reveals several interesting patterns:
The analysis demonstrates the complex interplay between various factors affecting income levels in Hungary. The findings suggest that both demographic factors (age, gender) and professional factors (occupation, location) play important roles in determining income levels.